Page Layout Classification Technique for Biomedical Documents
نویسندگان
چکیده
The structural layout information of scanned document pages is valuable for a wide range of document processing applications such as automatic document searching, document delivery and automated data entry. This paper describes the classification of scanned document pages into different classes of physical layout structures. The page layout classification technique proposed in this paper uses a combination of geometry-based and content-based zone features calculated from optical character recognition (OCR) output. Geometry-based and content-based features are derived from geometric zone information and zone contents respectively. A new feature called “single and multiple column zone vertical area string pattern” is also proposed to normalize document image pages. After normalizing document pages, a template matching algorithm calculates similarity classification features by matching vertical area string patterns of document pages to those of predefined layout document structures. Similarity classification features and both geometry-based and content-based zone features are then input into a rule-based learning system for the final decision on the page layout classification structure. The performance of our document page layout classification scheme has been evaluated using a sample size of several hundred images of biomedical journal pages. Preliminary evaluation results show that our approach is capable of classifying journal pages into different classes of physical layout structures at an accuracy of more than 96 %.
منابع مشابه
Document page similarity based on layout visual saliency: Application to query by example and document classification
In this paper we propose to define a measure of visual similarity to compare different pages in a corpus. This measure is based on the analysis of the visual layout saliency of the page composition. This similarity is computed using both the document layout and characteristics of the text itself. The text characterization uses statistical features derived from textural primitives. Our purpose i...
متن کاملAn Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis
The analysis of the background structure (whitespace) of page images has become an important technique for physical document layout analysis. Globally maximal whitespace rectangles have been previously demonstrated to constitute a concise representation of the major layout features of documents. However, previous methods for computing maximal whitespace rectangles were limited to axisaligned re...
متن کاملPage Classification for Meta-data Extraction from Digital Collections
Automatic extraction of meta-data from collections of scanned documents (books and journals) is a useful task in order to increase the accessibility of these digital collections. In order to improve the extraction of meta-data, the classification of the page layout into a set of pre-defined classes can be helpful. In this paper we describe a method for classifying document images on the basis o...
متن کاملDocument page similarity based on layout visual saliency: application to query by example and document classificat - Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on
In this paper we propose to define a measure of visual similarity to compare different pages in a corpus. This measure is based on the analysis of the visual layout saliency of the page composition. This similarity is computed using both the document layout and characteristics of the text itself. The text characterization uses statistical features derived from textural primitives. Our purpose i...
متن کاملSTAN: Structural Analysis for Web Documents
In this paper we present STAN, a structural analysis tool used for classifying web documents while at the same time extracting meaningful information from them. The extraction and classification rules are defined in terms of a structrural grammar operating on both layout properties and content properties of the document. Stan was designed to accept HTML as input and is able to process documents...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000